Applying a Hybrid Query Translation Method to 
Japanese/English Cross-Language Patent Retrieval 
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Abstract 

This paper applies an existing query trans- 
lation method to cross-language patent re- 
trieval. In our method, multiple dictionar- 
ies are used to derive all possible transla- 
tions for an input query, and collocational 
statistics are used to resolve translation am- 
biguity. We used Japanese/English parallel 
patent abstracts to perform comparative exper- 
iments, where our method outperformed a sim- 
ple dictionary-based query translation method, 
and achieved 76% of monolingual retrieval in 
terms of average precision. 



1 Introduction 

Since 1978, JAPIO (Japan Patent Information 
Organization) has operated PATOLIS, which is 
one of the first on-line patent retrieval services 
in Japan, and currently provides clients (i.e., 
8,000 Japanese companies) with patent infor- 
mation from 62 countries and 5 international 
organizations. At the same time, since a patent 
obtained in a single country can be protected in 
multiple countries simultaneously, it is feasible 
that users are interested in retrieving patent in- 
formation across languages. Motivated by this 
background, JAPIO manually summarizes each 



patent document submitted in Japan into ap- 
proximately 400 characters, and translates the 
summarized documents into English, which are 
provided on PAJ (Patent Abstract of Japan) 
CD-ROMsQ. 

In this paper, we target cross-language in- 
formation retrieval (CLIR) in the context of 
patent retrieval, and evaluate its effectiveness 
using Japanese/English patent abstracts on 
PAJ CD-ROMs. 

In brief, existing CLIR systems are classified 
into three approaches: (a) translating queries 
into the document language ||, ||, (b) translat- 
ing documents into the query language |]i~3| , p!i[| , 
and (c) representing both queries and docu- 
ments in a language-independent space [||, 0, 
15]. However, since developing a CLIR sys- 
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tern is expensive, we used the CLIR system pro- 
posed by Fujii and Ishikawa []|, ||, which follows 
the first approach. 

This system has partially been developed for 
the NACSIS test collection Jl(J, which con- 
sists of 39 Japanese queries and approximately 
330,000 technical abstracts in Japanese and 
English. However, since patent information 
usually includes technical terms, it is expected 
that this system also will perform reasonably 
for patent abstracts. 



1 Copyright by Japan Patent Office. 



2 System Description 



Figure |l] depicts the overall design of our CLIR 
system, in which we combine a query transla- 
tion module and an IR engine for monolingual 
retrieval. Unlike the original system proposed 
by Fujii and Ishikawa [||, |6| targeting the N AC- 
SIS collection, we use the JAPIO collection for 
the target documents. Here, the JAPIO col- 
lection is a subset of PAJ CD-ROMs. We will 
elaborate on this collection in Section |3[ In this 
section, we briefly explain the retrieval process 
based on Figure |]. 

First, query translation is performed for the 
source language query to output the transla- 
tion. For this purpose, a hybrid method in- 
tegrating multiple resources is used. To put 
it more precisely, the EDR technical/general 
dictionaries § are used to derive all possible 
translation candidates for words and phrases 
included in the source query. In addition, for 
words unlisted in dictionaries, transliteration is 
performed to identify phonetic equivalents in 
the target language. 

Then, bi-gram statistics extracted from 
NACSIS documents in the target language are 
used to resolve the translation ambiguity. Ide- 
ally, bi-gram statistics should be extracted 
from the JAPIO collection. However, since the 
number of documents in this collection is rela- 
tively small, when compared with the NACSIS 
collection (see Section ||) , we avoided the data 
sparseness problem. 

Since our system is bidirectional between 
Japanese and English, we tokenize documents 
with different methods, depending on their lan- 
guage. For English documents, the tokeniza- 
tion involves eliminating stopwords and iden- 
tifying root forms for inflected content words. 
For this purpose, we use WordNet |4|, which 
contains a stopword list and correspondences 
between inflected words and their root form. 

On the other hand, we segment Japanese 
documents into lexical units using the ChaSen 
morphological analyzer [12|, which has com- 
monly been used for much Japanese NLP re- 
search, and extract content words based on 
their part-of-speech information. 
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Figure 1: The overall design of our cross- 
language patent retrieval system. 



Second, the IR engine searches the JAPIO 
collection for documents relevant to the trans- 
lated query, and sorts them according to the 
degree of relevance, in descending order. Our 
IR engine is based on the vector space model, 
in which the similarity between the query and 
each document (i.e., the degree of relevance of 
each document) is computed as the cosine of 
the angle between their associated vectors. We 
use the notion of TF-IDF for term weighting. 
Among a number of variations of term weight- 
ing methods [16, 18], we tentatively use the for- 
mulae as shown in Equation ((!]). 



TF 
IDF 



1 + log(/ M ) 



MS 



(1) 



Here, j t ,d denotes the frequency that term t ap- 
pears in document d, and n t denotes the num- 
ber of documents containing term t. N is the 
total number of documents in the collection. 

For the indexing process, we first tokenize 
documents as explained above (i.e., we use 
WordNet and ChaSen for English and Japanese 
documents, respectively), and then conduct the 
word-based indexing. That is, we use each con- 
tent word as a single indexing term. 

Finally, since retrieved documents are not in 
the user's native language, we optionally use 
a machine translation system to enhance read- 
ability of retrieved documents. 



3 Experimentation 

Since no test collection for Japanese/English 
patent retrieval is available to the public, we 
produced our test collection (i.e., the JAPIO 
collection), which consists of three Japanese 
queries and Japanese/English comparable ab- 
stracts. 

Each query, which was manually produced, 
consists of the description and narrative, and 
corresponds to different domains, i.e., electrical 
engineering, mechanical engineering and chem- 
istry. Figure || shows the three query descrip- 
tions in the second column. 

In conventional test collections, relevance as- 
sessment is usually performed based on the 
pooling method [17], which first pools candi- 
dates for relevant documents using multiple re- 
trieval systems. However, since in our case only 
one system described in Section ^ is currently 
available, a different production method was 
needed. 

To put it more precisely, for each query (do- 
main), target documents were first collected 
based on the IPC classification number, from 
PAJ CD-ROMs in 1993-1998. Then, for each 
query, three professional human searchers, who 
were allowed to enhance queries based on the- 
sauri and their introspection, searched the tar- 
get documents for relevant documents. 

Thus, in practice, the JAPIO collection con- 
sists of three different document collections cor- 
responding to each query. In Figure §, the third 
and fourth columns denote the number of rele- 
vant documents and the total number of target 
documents for each query. 

We compared the following methods: 



Here, we empirically set k = 1. Although the 
performance of JEDIS did not significantly dif- 
fer as long as we set a small value of k (e.g., 
k = 5), we achieved the best performance when 
we set k = 1. 

Figure [3| shows recall-precision curves for the 
above three methods, where JEDIS generally 
outperformed JEALL, and J J generally outper- 
formed both JEALL and JEDIS, regardless of 
the recall. The difference between JEALL and 
JEDIS is attributed to the fact that JEDIS re- 
solved translation ambiguity based on bi-gram 
statistics extracted from the NACSIS collec- 
tion. Thus, we can conclude that the use of 
bi-gram statistics (even extracted from a col- 
lection other than the JAPIO collection) was 
effective for the query translation. 

Table |l] shows the non-interpolated aver- 
age precision values, averaged over the three 
queries, for each method. This table shows that 
J J outperformed JEALL and JEDIS, JEDIS 
outperformed JEALL, and the average preci- 
sion value for JEDIS was 76% of that obtained 
with JJ. 

These results are also observable in existing 
CLIR experiments using the TREC and NAC- 
SIS collections. Thus, we conclude that our 
cross-language patent retrieval system is rela- 
tively comparable with those for newspaper ar- 
ticles and technical abstracts in performance. 

However, we could not conduct statistical 
testing, which investigates whether the differ- 
ence in average precision is meaningful or sim- 
ply due to chance ||, because the number of 
queries is small. We concede that experiments 
using a larger number of queries need to be fur- 
ther explored. 



Japanese-English CLIR, where all possi- 
ble translations derived from EDR dictio- 
naries and the transliteration method were 
used as query terms (JEALL), 

Japanese-English CLIR, where disam- 
biguation based on bi-gram statistics were 
performed, and /c-best translations were 
used as query terms (JEDIS), 

Japanese- Japanese monolingual IR (JJ). 



4 Conclusion 

In this paper, we explored Japanese/English 
cross-language patent retrieval. For this pur- 
pose, we used an existing cross-language IR 
system relying on a hybrid query translation 
method, and evaluated its effectiveness us- 
ing Japanese queries and English patent ab- 
stracts. The experimental results paralleled ex- 
isting experiments. That is, we found that re- 



IPC 


Description 


^Relevant 


^Documents 


electronics 


GPS car navigation system based on VICS 


930 


7,526 


mechanics 


eliminating dioxin in burning solid wastes 
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antibacterial plastic combining inorganic materials 


473 
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Figure 2: Query descriptions in the JAPIO collection. 
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Figure 3: Recall-precision curves for different 
methods. 

solving translation ambiguity was effective for 
the query translation, and that the average 
precision value for cross-language IR was ap- 
proximately 76% of that obtained with mono- 
lingual IR. Future work will include qualita- 
tive/quantitative analyses based on a larger 
number of queries. 
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